Tagging news articles or blog posts with relevant tags from a collection ofpredefined ones is coined as document tagging in this work. Accurate tagging ofarticles can benefit several downstream applications such as recommendation andsearch. In this work, we propose a novel yet simple approach called DocTag2Vecto accomplish this task. We substantially extend Word2Vec and Doc2Vec---twopopular models for learning distributed representation of words and documents.In DocTag2Vec, we simultaneously learn the representation of words, documents,and tags in a joint vector space during training, and employ the simple$k$-nearest neighbor search to predict tags for unseen documents. In contrastto previous multi-label learning methods, DocTag2Vec directly deals with rawtext instead of provided feature vector, and in addition, enjoys advantageslike the learning of tag representation, and the ability of handling newlycreated tags. To demonstrate the effectiveness of our approach, we conductexperiments on several datasets and show promising results againststate-of-the-art methods.
展开▼
机译:使用预定义的集合中的相关标签来标记新闻文章或博客文章,在本文中被称为文档标记。物品的正确标记可以使一些下游应用程序受益,例如推荐和搜索。在这项工作中,我们提出了一种新颖却简单的方法,称为DocTag2Vecto来完成此任务。我们大量扩展了Word2Vec和Doc2Vec-这两个用于学习单词和文档的分布式表示的流行模型。在DocTag2Vec中,我们在训练过程中同时学习了联合向量空间中单词,文档和标签的表示,并使用了简单的$ k $ -最近邻居搜索以预测看不见文档的标签。与以前的多标签学习方法相比,DocTag2Vec直接处理原始文本而不是提供的特征向量,此外,它还具有学习标签表示和处理新创建标签的能力。为了证明我们方法的有效性,我们在几个数据集上进行了实验,并针对最先进的方法显示了令人鼓舞的结果。
展开▼